An Overview of Cancer in the First 315,000 All of Us Participants

Introduction The NIH All of Us Research Program will have the scale and scope to enable research for a wide range of diseases, including cancer. The program’s focus on diversity and inclusion promises a better understanding of the unequal burden of cancer. Preliminary cancer ascertainment in the All of Us cohort from two data sources (self-reported versus electronic health records (EHR)) is considered. Materials and methods This work was performed on data collected from the All of Us Research Program’s 315,297 enrolled participants to date using the Researcher Workbench, where approved researchers can access and analyze All of Us data on cancer and other diseases. Cancer case ascertainment was performed using data from EHR and self-reported surveys across key factors. Distribution of cancer types and concordance of data sources by cancer site and demographics is analyzed. Results and discussion Data collected from 315,297 participants resulted in 13,298 cancer cases detected in the survey (in 89,261 participants), 23,520 cancer cases detected in the EHR (in 203,813 participants), and 7,123 cancer cases detected across both sources (in 62,497 participants). Key differences in survey completion by race/ethnicity impacted the makeup of cohorts when compared to cancer in the EHR and national NCI SEER data. Conclusions This study provides key insight into cancer detection in the All of Us Research Program and points to the existing strengths and limitations of All of Us as a platform for cancer research now and in the future.


Introduction
Cancer was the leading cause of death in the United States after cardiovascular diseases in 2020, with over 600,000 cancer-related deaths and a further 1.8 million expected diagnoses [1][2][3][4][5]. Although treatments are improving and personalized medicine promises advancements, cancer diagnoses are expected to increase substantially over the next decade, due mainly to the aging population in the US and modifiable behavioral/lifestyle factors [1,2,6]. The risk of developing cancer depends on the complex interplay of factors including genes, age, and gender, lifestyle and behavioral factors such as diet, energy balance, physical activity, tobacco and alcohol use; endogenous factors such as hormones and growth factors; medication and drug use; infectious agents; and environmental exposures [1,6]. Precision medicine and precision health, which consider the patient as an individual, hold promise for cancer research [7][8][9][10]. For instance, individuals with similar diagnoses often receive the same treatment despite observations that efficacy varies by patient. Additionally, new approaches to precision prevention and early detection, informed by an enriched understanding of the etiology and natural history of cancer, could improve clinical interventions.
With over one million participants, the All of Us Research Program will have the scale to enable research on myriad diseases, especially cancer [11][12][13]. The program's focus on diversity and inclusion promises to shed light on US cancer inequities, as fewer than 2% of cancer studies have been powered to consider race/ethnicity [14,15]. Given its diversity and large sample size, All of Us may have the statistical power to answer questions about the causes of cancer and drivers of disparities and identify opportunities for precision prevention.
Researchers currently have access to data from over 315K All of Us participants through the Researcher Workbench. Although the program does not target enrollment by health status, the sample to date includes a sufficient number of participants with a history of cancer, prevalent cancers, and incident cancers to enable systematic studies of cancer risk, outcomes, medication effects, and therapeutic approaches across environmental, social, genomic, and economic contexts. This demonstration project examines the distribution and characterization of cancer in All of Us and compares these numbers to expected national rates reported by the Surveillance, Epidemiology, and End Results (SEER) Program [16] and distribution in the US population.

All of us research projects
The goals, recruitment methods and sites, and scientific rationale for All of Us have been described previously [17]. Demonstration projects were designed to establish the value of the cohort by describing the cohort and replicating previous findings for validation [18]. The work described here was proposed by Consortium members, reviewed and overseen by the program's Science Committee, and was confirmed as meeting criteria for non-human subjects research by the All of Us Institutional Review Board. The initial release of data and tools used in this work was published in 2020 [18].
This work was performed using the All of Us Researcher Workbench, a cloud-based platform where approved researchers can access and analyze All of Us data. At the time of analysis, the All of Us data included survey responses, Electronic Health Records (EHR), and physical measurements (PM). These three types of data are collected either at an All of Us affiliated health care provider organization (HPO) or through a "direct-volunteer" mechanism. HPOs include regional medical centers, federally qualified health centers, and the Veterans Health Administration. HPOs recruit the majority of program participants-mainly persons affiliated with their center. The direct-volunteer route allows those who are not HPO patients to enroll online and visit a designated health clinic, blood bank, laboratory, or health care provider organization to have their PM collected. All three data types (survey, PM, and EHR) were mapped to the Observational Health and Medicines Outcomes Partnership (OMOP) common data model v 5.2 maintained by the Observational Health and Data Sciences Initiative (OHDSI) collaborative To protect participant privacy, a series of data transformations were applied. These included data suppression of codes with a high risk of identification such as military status; generalization of categories, including age, sex at birth, gender identity, sexual orientation, and race; and date shifting by a random (less than one year) number of days, implemented consistently across each participant record. Documentation on privacy implementation and creation of the CDR is available in the All of Us Registered Tier CDR Data Dictionary [19]. The Researcher Workbench currently offers tools with a user interface (UI) built for selecting groups of participants (Cohort Builder), creating datasets for analysis (Dataset Builder), and Workspaces with Jupyter Notebooks (Notebooks) to analyze data. The Notebooks enable use of saved datasets and direct query using R and Python 3 programming languages.

Study population
Participant-provided information for our analysis was derived from the surveys described above. The full text of these surveys is available in the Survey Explorer found in the All of Us Research Hub, a publicly available website designed to support researchers [20]. The Basics survey elicits demographic information including age, race/ethnicity, education, marital status, household income, and geography. The Lifestyle survey collects tobacco use data. Personal Medical History collects self-reported cancer history, including cancer type(s), life stage at diagnosis, and whether the participant is currently seeing a health care provider and/or receiving cancer treatment. The Basics and Lifestyle surveys are collected at baseline, whereas Personal Medical History is collected during retention efforts 3 months after enrollment.
Time from diagnosis in the EHR was calculated as the current date minus the date of diagnosis, reported in years (mean, SD, and median).

National comparison
We compared the observed frequency of cancer reported in All of Us to National Cancer Institute's SEER 18 Registries Database, November 2018 submission [21], to analyze cancer frequency overall and by site based on cases diagnosed in 2016 among residents of the areas included in the 18 registries covering *28% of the United States population. We reported the frequency of diagnosis in 2016 by assessing the limited duration 26-year cancer prevalence to determine the relative frequency and percent contribution of each cancer type to all cancers in the population by evaluating prevalence data representing the first invasive tumor site. Limited-Duration Prevalence represents the proportion of people alive on a certain day who had a diagnosis of the disease within the past x years (e.g. x = 5, 10 or 20 years). We chose the most recent year of diagnosis given the period for which All of Us has been conducting enrollment. Skin cancer (melanoma of the skin) was excluded from the "total cancer" calculation for SEER cancers and from the analysis since the All of Us survey data does not differentiate between melanoma and non-melanoma skin cancer. Invasive cancer was coded using the International Classification of Diseases for Oncology, third edition (ICD-O-3) [22].

Data analysis
We generated descriptive statistics and prevalence for the most common cancers and used Chi-square tests to test the difference in the categorical distribution of data source types (survey data, EHR, and both) across the key demographic and lifestyle categories. The percent distribution of cancer types was calculated as the number of cases per site/total number of cancer cases in each respective dataset. Results are stratified by race/ethnicity and sex at birth to consider the demographic-specific distributions in cancer types. Cancer frequency was calculated using SEER � Stat 8.3.9 [23]. Table 1 shows the distribution of the baseline characteristics of all participants (N = 315,297), and by those with a cancer outcome as captured from the EHR (N = 203,813 participants with EHR; including N = 23,520 cancer cases), via self-report in the survey database (N = 89,261 completed Personal Medical History survey; including N = 13,298 cancer cases), and from participants with both survey and EHR data (N = 62,497 participants with both data types; including N = 7,123 cancer cases). Personal Medical History survey completion varies considerably, with older, female, and non-Hispanic Whites more likely to provide data than the population with available EHR (that more closely reflects the larger All of Us participant population). Differences across key demographic factors in data availability (survey data and/or EHR) are reflected in the distribution of cancer from the different data sources. Specifically, 84.8% of cancers from the Personal Medical History survey were reported by non-Hispanic Whites, 5.0% by Blacks, and 4.7% by Hispanics compared to 67.1%, 14.3%, and 12.2% respectively captured from the EHR. Non-Hispanic Whites are overwhelmingly represented among those with both self-report and EHR data (75.8%) compared to 51.5% representation in the overall All of Us study population. All p-values for the chi-square values comparing the distributions are <0.001 except the comparison of EMR versus total (which is 0.002). Table 2 shows that All of Us participants' EHR data indicate a history of breast cancer most frequently (N = 6,474; 27.5% of cases) followed by blood cancers (N = 4,841; 20.6%) and prostate cancer (N = 3,971; 16.9%). This mirrors the most common self-reported cancers (from the survey) for breast cancer (N = 4,062; 30.5%) and prostate cancer (N = 2,165; 16.3%) but not for blood cancer (N = 483; 9.9%). There are N = 2,499 individuals with breast cancer documented from both the survey and EHR data sources, followed by N = 1,304 individuals with prostate cancer cases, and followed by N = 657 blood cancer cases. Prevalence is broken down by cancer site showing the difference in contribution to disease burden by data source. Table 3 presents cancer type distribution from each data source by race and ethnicity, with N = 6,125 cancer cases detected in both data sources for non-Hispanic Whites compared to N = 328 cancer cases in African Americans and N = 294 cancer cases in Hispanics. Differences  ) in the distribution of cancer types between survey data and EHR are observed by race/ethnicity (both within and between race/ethnicity, comparing non-Hispanic Whites, Blacks, and Hispanics (<0.001)). The prevalence of cancer subsequently varies by race/ethnicity in each data source as well as reported here. Table 4 compares the distribution of cancer sites from All of Us survey data and EHR to the expected distribution nationally, based on recent SEER reports of the 26-year limited duration prevalence in 2018. The most common cancer types in SEER (based on contribution to total cancers) are breast cancer (19.9%), prostate cancer (17.6%), blood cancers (11.4%), and colorectal cancers (8.4%). The percent contribution to the cancer burden nationally (as illustrated by SEER data) from each cancer site differs significantly from the EHR site distribution (p<0.001) and the self-reported distribution (p<0.001). As expected, the percent of persons enrolled into All of Us largely from medical centers have a higher proportion of prevalent cancer (11.54% in EHR and 14.90% in survey) than in the US population reported by SEER (4.43%). � SEER data is based on 5-year prevalence frequency counts of 1 st invasive tumor. Table 5 presents a description of the time from cancer diagnosis as reported in the EHR and survey database. The cancer with the shortest time from diagnosis in the EHR is lung cancer (mean = 5.85 years; SD = 4.46), and the longest time from diagnosis is for head and neck cancer (mean = 11.75 years; SD = 6.59). Across all cancer types, the most common period of diagnosis was adult, followed by older adult. Table 6 presents treatment types for cancer overall and by site. The most common treatment from the EHR is radiation (N = 7,422; 31.56%), followed by surgery (N = 5,975; 25.54%), hormone therapy (N = 3,962; 16.84%), chemotherapy (N = 842; 3.58%), immunotherapy (N = 470; 1.2%), and stem cell transplant (N = 127; 0.54%). Treatment type utilization varied by cancer site.

Conclusions
In this preliminary analysis of data from the All of Us Research Program, we report that the first 315K+ persons comprise a diverse population with a large number of prevalent cancer cases. As the goal of this effort is to inform studies on a variety of health conditions, including cancer, and to delineate information on risk factors and treatments, an early evaluation of cancers represented in the study population is warranted. Our findings have some key implications for cancer prevention, control, treatment, and outcomes research in the All of Us study population. Our most notable finding is simple: although a diverse cohort is being enrolled, selfreported cancers are not being ascertained as frequently through the survey modules among underrepresented participants. As validation of diagnosis from EHR using manual verification or self-report is the gold standard to ensure accurate classification and minimize measurement error, the difference in valid case ascertainment by key factors like race is relevant for All of Us cancer research. The drop in cancer data detected from the survey or validated with survey data is associated with racial/ethnic differences in longitudinal retention. Although surveys are completed by a relatively older population, age doesn't appear to be a key factor influencing differences in data collection. History of cancer is collected through a survey completed at least 90 days after enrollment in All of Us, with an overall medical history survey completion rate among underrepresented participants of 22% across the program compared to 42% in non-UBR participants. Some factors noted in the literature previously [24] that could be of relevance for differences in retention by race/ethnicity include language, literacy, cultural appropriateness, flexibility, ongoing incentives, communication, and of particular growing importance with increasingly electronic survey data collection is the digital divide. This has research implications for the cancer history data collected at follow-up as well as other key risk factor information including health care utilization, personal medical history, and family history. Our investigation shows that the impact of these factors on cancer disparities will be underreported even if cancer history can be obtained from the EHR of most underrepresented participants. All of Us leadership has changed survey module timeline and made Personal Medical History available at baseline, addressing some of the limitations noted here for prospective enrollees. Furthermore, the difference in cancer ascertainment between survey modules and EHR modalities in underrepresented participants highlights the importance of technologies to integrate the medical records of direct volunteers. Sync for Science for obtaining EHRs from direct volunteers or other non-digital methods of collecting survey data could offer utility beyond the ability to confer medical record information for direct volunteers, as there are implications for inclusion and equity in the investigation of all diseases, including cancer.
The distribution of cancer sites between the two data sources when compared to SEER national statistics is impacted by exclusion of skin cancer from the All of Us cancer analyses. Skin cancer cases account for approximately half of the total cases reported in the survey data. These cases likely include both malignant and non-malignant skin cancers, which would introduce significantly different relative proportions of other cancers if included in the analysis. As restriction to malignant cases was not possible, we excluded all skin cancer cases from analysis.
Another point to consider is the grouping of blood cancers. Because the survey module asks about blood cancers generically, it is impossible to differentiate between myeloma, lymphoma, and leukemia in survey responses. This distinction can be deciphered from the EHR when available. The ability to distinguish these types will be crucial to many cancer researchers.
We further report on the time from diagnosis and the life stage to consider opportunities to collect incident cases or investigate hypotheses for more recent diagnoses. The utility of the life stage questions in etiology or outcomes research is unclear, as the groups (age ranges (child (0-11); adolescent (12)(13)(14)(15)(16)(17); adult ; older adult (65-74) and elderly (75+)) are quite broad in the survey. A more refined or consistent metric, such as date of cancer diagnosis, would aid investigation of various cancer-related hypotheses (such as being able to stratify by pre and post menopausal breast cancer. Presenting this data side-by-side highlights how distinct these metrics of diagnosis timing really are.
The All of Us Research Program is set to become one of the largest scientific efforts in U.S. history, and its emphasis on inclusion presents key opportunities to advance precision health and medicine and address disparities in research [25]. Despite the limitations noted in this report, this unprecedented depth of inclusion will confer an important resource for cancer research. All of Us was conceived to support studies of disease outcomes, medication effects, and other therapeutic approaches across various environmental, social, genomic, and economic contexts [26]. The scale and scope of its current cancer data will support extensive investigation of cancer-related hypotheses and enhance the pace of discovery and generalizability. The cohort's expansion to 1 million participants will create further opportunities. Furthermore, feedback from demonstration projects such as this one will directly inform edits to existing surveys and development of reassessment modules.
In summary, the All of Us Research Program has collected significant cancer data from its first 315K participants. This preliminary investigation notes the most common cancers that will confer sufficient study power for research, especially once whole genome data is available for all participants. Considering our findings, the program might consider the implications of lower retention through survey completion among underrepresented participants on the resource's utility for research on cancer and other diseases.